Topics

  • Get a brief introduction to Bayesian statistics and understand how it differs from the frequentist point of view.
  • Understand the difference between a credible interval and a confidence interval, particularly in how they can be interpreted.
  • Understand how to create a Bayesian simple linear regression model with the brms package using different priors.
  • Understand how to create a Bayesian multivariate linear regression model with the brms package using different priors.

Installations and Data

This tutorial will make use of the following R libraries.

library(brms)
library(patchwork)
library(pander)
library(broom.mixed)
library(tidyverse)
library(bayesplot)
library(latex2exp)

This tutorial will be using the nhanes dataset where the variables are described in the file nhanes-codebook.txt. This data can be loaded with the load function, specifying the rda data file.

load(file = 'nhanes1518.rda')

Introduction to Bayesian Statistics

Beliefs and Priors

There are two schools of thought in the world of statistics: frequentist, which we have been working with up to this point, and Bayesian. In the frequentist school, probability is defined as a frequency. For example, if we roll a fair six-sided die, the probability we will roll a 1 is \(P(Y = 1) = \frac{1}{6}\). The conclusions we make in frequentist statistics are based on likelihood; we retrieve data and then make an inference based on the data.

In the Bayesian school, probability is defined as a belief. For example, a Bayesian statement could be “I believe the probability that it rains tomorrow is 30%.” Different people obviously have different beliefs, so another statistician stating “I believe the probability that it rains tomorrow is 40%” is just as valid. Like frequentist statistics, our conclusions rely on likelihood. However, unlike frequentist statistics, our conclusions combine the likelihood with another factor: our prior knowledge. When we collect data, we then combine our prior knowledge and our likelihood to update our prior knowledge. For simplicity, we can simply refer to our prior knowledge that we wish to update as the prior. After we see our data, we combine our likelihood with our prior to create the posterior.

It is important to understand that different statisticians can have different priors for the same data. For example, a statistician who lives in the desert may believe that the probability that it rains tomorrow is 5%. However, if they move to the rainforest, their belief on future rain chances will change dramatically. This example analogizes how we combine our prior knowledge (living in the desert leads to a low chance for rain) with our data (the amount of rain after we move to the rainforest) to obtain our posterior (living in the rainforest leads to a high chance for rain).

A statistician can also have no prior knowledge. Suppose, unrealistically, that a statistician has no idea how likely it is to rain on any given day. When that statistician is asked for the probability that it will rain tomorrow, they have no beliefs to go off of. This statistician can then use an uninformative prior, or one that tells us little about the situation at hand, to describe rain chances. A common example of an uninformative prior is a \(Uniform(0, 1)\) distribution, which would mean the probability that it rains tomorrow has an equal chance of being any value from 0 to 1.

But how exactly do we get from the prior and the likelihood to the posterior?

Bayes’ Rule

The fundamental concept of Bayesian statistics is the fittingly-named Bayes’ rule. Given a likelihood function \(p(y | \theta)\) and a prior \(p(\theta)\), the posterior \(p(\theta | y)\) stated by Bayes’ rule is \(p(\theta | y) = \frac{p(y | \theta)p(\theta)}{p(y)}\). \(p(y)\) is the marginal likelihood, which we can get by integrating out the \(theta\) from the likelihood function:

\[ p(y) = \int p(y|\theta)p(\theta)d\theta \]

For example, we can use the Bayes’ rule with our nhanes dataset to estimate the mean weight (BMXWT) of the population. We may assume that the weights in our data are independent and identically distributed from a Normal distribution \(y_i | \mu\sim \mathcal{N}(\mu, \sigma^2)\). We also may have some prior knowledge of the mean weight \(\mu\); let’s say our belief is that \(\mu\) also follows a normal distribution, \(\mu ~ \sim \mathcal{N}(\mu_0, \sigma_0^2)\). By plugging in our likelihood function \(p(y|\mu)\) and our prior \(p(\mu)\) and doing some algebra and calculus, we would obtain a posterior \(p(\mu|y)\):

\[ \mu | y \sim \mathcal{N}\left(\frac{\sigma_0^2}{\frac{\sigma^2}{n}+\sigma_0^2} \space \bar{y} \space + \frac{\sigma^2}{\frac{\sigma^2}{n}+\sigma_0^2} \space \mu_0 \space, \space \left(\frac{1}{\sigma_0^2} \space + \space \frac{n}{\sigma^2}\right)^{-1}\right) \]

We can see that the posterior mean (the first parameter of the normal distribution above) is a weighted sum of the prior mean \(\mu_0\) and the sample mean \(\bar{y}\). This helps illustrate how the Bayes’ rule combines the information from our data and from our prior in order to form the posterior.

This is a rather simple example, since the function for \(p(y)\) ends up being a closed-form expression and we are working with common distributions. Particularly, this example makes use of a conjugate prior: this occurs when the prior, likelihood, and posterior all follow the same probability distribution. In this case, the prior and likelihood both came from a Normal distribution, and the posterior also ended up being normally distributed. Statisticians usually prefer using conjugate priors as it makes computation much simpler. However, conjugate priors may not be available or well-suited for every scenario. Under other priors, we may not be able to obtain a closed-form expression for \(p(y)\). In these situations where performing calculations proves to be extremely complex, we will not be able to apply the Bayes’ rule in order to retrieve the posterior. We would then have to use other sampling schemes–for example, Markov Chain Monte Carlo, which will be touched on later–in order to obtain the samples that follow the posterior distribution.

Posteriors & Inference

We have now discussed both the prior and the posterior, which correspond to the probability distributions of parameter \(\theta\) before and after we obtain the data respectively. In Bayesian statistics, \(\theta\) is a random variable or a quantity that can take on different values based on different outcomes of an experiment. Hence, \(\theta\) comes from a probability distribution. This is a key distinction from frequentist statistics, where \(\theta\) is a fixed value.

The different perspectives that frequentist and Bayesian statisticians have for what \(\theta\) is correspond to different methods for inference. For example, frequentists may obtain the maximum likelihood estimator for \(\theta\) and construct the confidence interval around this estimate. A limitation of this approach is that the probability is defined for the procedure of constructing the confidence interval, but it is not defined for a specific fixed sample. Thus, when interpreting a 95% confidence interval for example, we cannot say “there is a 95% chance that the confidence interval includes the true population parameter.” Instead, we say that if we drew many samples from the population and created a 95% confidence interval for each sample, the true population parameter will be found in 95% of the intervals.

In Bayesian statistics, we obtain the posterior probability distribution of \(\theta\) by applying Bayes’ rule to update our prior belief. We use the posterior distribution to make inference. For example, Bayesian statisticians usually use the mean of the posterior probability distribution as a point estimator. Then, they use quantiles of the posterior distribution to construct a credible interval, the Bayesian answer to a confidence interval. These have simpler interpretations.

To illustrate credible intervals, let’s assume that the posterior distribution of \(\theta\) is a \(Uniform(0, 1)\) distribution. This means \(\theta\) has an equal chance of being any value between 0 and 1. A 95% credible interval is between any lower bound \(L\) and any upper bound \(U\) such that the posterior probability of \(L < p < U = 0.95\). We use the quantiles of our posterior distribution for \(\theta\) to obtain \(L\) and \(U\). In the case of a 95% interval, these are the 0.025 and 0.975 quantiles (the middle 95% of the data). For a \(Uniform(0, 1)\) distribution, this means \(L = 0.025\) and \(U = 0.975\). We can then say: “There is a 95% probability that the true population parameter lies between 0.025 and 0.975, given the evidence provided by the observed data.”

Bayesian Simple Linear Regression

Notation, Parameters, and Assumptions

In statistics, we use linear regression to model the relationship between a response variable and one or more explanatory variables. Linear regression is done differently in frequentist statistics and Bayesian statistics.

In simple linear regression, we have one explanatory variable. Simple linear regression takes the form \(y = \beta_0 +\beta_1 x +\epsilon\), where \(y\) is the response variable, \(x\) is the explanatory variable, \(\beta_1\) is the true slope of the relationship between \(x\) and \(y\), \(\beta_0\) is the true intercept of the relationship between \(x\) and \(y\), and \(\epsilon\) is the random error term.

In frequentist statistics, linear regression uses point estimates for parameters \(\beta_0\) and \(\beta_1\) in order to help predict the value of the response variable based on the explanatory variable. The formula for Bayesian regression may appear to be the same, but now, our parameters are drawn from probability distributions. If we have a vague prior, we will have weak correlations between the independent and dependent variables. This explains the lack of steepness in the lines on the Bayesian graph in the visualization above. After we see the data and update our prior, we will begin to see stronger correlations between the variables.

The assumptions for Bayesian linear regression are the same for the frequentist method, being as follows:

  • linearity: The relationship between the response and explanatory variables is linear.
  • independence: The observations are independent of each other, meaning that the results for one observation are not influenced by the results of any other observation.
  • homoscedasticity: The variance of the errors is constant for all values of the explanatory variables.
  • normality: The errors are normally distributed.
  • no multicollinearity: The explanatory variables are not highly correlated with each other.
  • no autocorrelation: The errors are not correlated with each other.

It is important that these assumptions are not violated to preserve the validity of the regression model. Later, we will learn how to verify that our model fits these assumptions.

Motivation

As discussed earlier, we perform simple linear regression when we want to explore the relationship between one explanatory variable and one response variable. For example, we might want to better understand the relationship between body mass index and age for a variety of reasons–perhaps we want to know if different age groups are more prone to obesity. We can help answer this question with a simple linear regression model using the nhanes dataset.

For this linear model, we want to predict body mass index based on age; therefore, we will use BMXBMI as our response variable and RIDAGEYR as our explanatory variable. We will limit our analysis to adults.

nhanes1518 <- nhanes1518 %>% 
  filter(RIDAGEYR >= 18)

Model

We will use the brms library to create our Bayesian model. An important part of creating a Bayesian regression model is establishing our priors for the parameters. Based on our prior knowledge of the data, there is a variety of priors we can set (and thus a variety of models we can create). Later, we will discuss strategies for selecting model priors. For now, we will assume that we have no prior knowledge of the dataset, using a noninformative prior (the default prior in brms). We can use the brm function to create our model. The formula argument of this function follows the format y ~ x, with y being the response variable and x being the explanatory variable. We will save our model in a variable named model1.

model1 <- brm(formula = BMXBMI ~ RIDAGEYR, 
             data    = nhanes1518,
             seed    = 123)

Interpretation

Once we have created our model, we will then interpret its results. Our model gives us posterior distributions for \(\beta_{Intercept}\) and \(\beta_{Age}\); we can plot these with the mcmc_dens function of the bayesplot package, specifying the distributions we want to plot with the pars argument.

We can also use the summary function to view more detailed statistics of our model. We will set the priors argument to TRUE so that the summary includes the prior distributions we used for our model.

mcmc_dens(model1, pars = c("b_Intercept", "b_RIDAGEYR"))

summary(model1, priors = TRUE)
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: BMXBMI ~ RIDAGEYR 
##    Data: nhanes1518 (Number of observations: 11096) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Priors: 
## Intercept ~ student_t(3, 28.4, 6.4)
## <lower=0> sigma ~ student_t(3, 0, 6.4)
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    28.66      0.20    28.28    29.04 1.00     4763     3278
## RIDAGEYR      0.02      0.00     0.01     0.03 1.00     5132     3187
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     7.26      0.05     7.17     7.36 1.00     3864     3165
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Similar to frequentists, we have point estimates for \(\beta_{Intercept}\) and \(\beta_{Age}\) that we are interested in. These are the means of the posterior distributions we plotted with the mcmc_dens function. The estimates can be found in the Intercept and BMXBMI rows of the summary output. These tell us that our estimate for the intercept is 28.66, and the estimate for our slope is 0.02. With these statistics, let’s now fill in our model:

\[\hat{BMI} = 28.66 + 0.02(Age)\]

We can interpret the estimate of the slope to mean that for every 1 year increase in age, we expect BMI to increase by 0.02 on average. The intercept does not have a meaningful interpretation in this case as it assumes a person with an age of 0, which is unrealistic.

Inference

We are also interested in the l-95% CI and u-95% CI columns of the summary output. These estimates are obtained from the credible intervals of the posterior distributions we plotted with mcmc_dens. We can thus state that there is a 95% chance that the true value for the intercept lies between 28.28 and 29.04, and there is a 95% chance that the true value for \(\beta_{Age}\) lies between 0.01 and 0.03.

Now that we have a credible interval for our slope, we might want to conclude whether there is a statistically significant relationship between BMI and age To do so, we can examine whether or not \(\hat{\beta_{Age}}\) is statistically different from 0. Our null hypothesis in this case is that the slope is equal to 0, while our alternative hypothesis is that the slope is not equal to 0 (which implies a linear relationship between BMI and age). We can answer this question with our credible interval. Since the interval for RIDAGEYR does not include 0, we can reject our null hypothesis and conclude that there is likely a linear relationship between age and body mass index (although it appears to be very small).

Prediction

Suppose we want to guess an individual’s BMI using our model. We can do this with a prediction interval; it differs from using normal confidence intervals as confidence intervals only express sampling uncertainty from our data. On the other hand, prediction intervals express sampling uncertainty from our data and from the single value with which we want to predict. As a result, prediction intervals are wider than confidence intervals.

We can use the predict function with our model to predict individual’s BMIs given an age of 20, 40, and 60.

set.seed(123)
predict(model1,
        data.frame(RIDAGEYR = (c(20, 40, 60))),
        interval = "prediction") %>%
  knitr::kable(digits = 2)
Estimate Est.Error Q2.5 Q97.5
29.07 7.21 14.73 43.29
29.33 7.24 15.05 43.08
29.77 7.30 15.33 44.06

Our prediction interval estimates that a 20-year-old individual likely has a BMI between 14.73 and 43.29. As we can see, the interval is extremely wide, limiting our ability to make meaningful conclusions; this is likely because the relationship between BMI and age is weak.

Bayesian Multivariate Linear Regression

Motivation

Now that we’ve tackled simple linear regression, let’s learn how to do multivariate linear regression the Bayesian way. We do multivariate regression when we want to know how one variable is impacted by multiple other variables. We can also use this method when we want to control for some variable. For example, in our nhanes dataset, we might want to know how factors like age, waist size, weight, and height impact BMI; we might also want to know how waist size, weight, and height impact BMI if we control for age. To help answer these questions, we can fit a model with BMXBMI as the response and RIDAGEYR, BMXWAIST, BMXWT, and BMXHT as the predictors.

Model

Let’s first visualize the distribution of all of our predictor variables in the dataset.

p1 <- ggplot(nhanes1518, aes(x = RIDAGEYR)) + 
  geom_histogram(color = "black", fill = "red") +
  labs(title = 'Distribution of Age in Dataset')

p2 <- ggplot(nhanes1518, aes(x = BMXWAIST)) + 
  geom_histogram(color = "black", fill = "blue") +
  labs(title = 'Distribution of Waist Size in Dataset')

p3 <- ggplot(nhanes1518, aes(x = BMXWT)) + 
  geom_histogram(color = "black", fill = "yellow") +
  labs(title = 'Distribution of Weight in Dataset')

p4 <- ggplot(nhanes1518, aes(x = BMXHT)) + 
  geom_histogram(color = "black", fill = "green") +
  labs(title = 'Distribution of Height in Dataset')

p1 + p2 + p3 + p4 + 
  plot_layout(ncol = 2)

As we can see from the x-axis labels, our predictors are not measured in the same way; age is in years, waist and height are in centimeters, and weight is in kilograms. This leads to these predictors having different ranges and, therefore, different variances. If we used this data as is, we would have to specify different variances for the prior distributions of each predictor. For simplicity purposes, we can use the scale function in R to normalize our data so that each predictor has a mean of 0 and a standard deviation of 1. This will allow us to specify the same prior for each of our predictor variables. Note that, in this example, scaling our data is for convenience; in real-world scenarios, this may not be the best approach.

nhanes1518['RIDAGEYR_sc'] <- scale(nhanes1518$RIDAGEYR)
nhanes1518['BMXWAIST_sc'] <- scale(nhanes1518$BMXWAIST)
nhanes1518['BMXWT_sc'] <- scale(nhanes1518$BMXWT)
nhanes1518['BMXHT_sc'] <- scale(nhanes1518$BMXHT)

Again, we will begin by assuming we have no prior knowledge of the data and use the default the prior in brms. We will again use the brm function to create our model. The format for the formula argument here is y ~ x1 + x2 + .. + xn, where each \(x_i\) is an explanatory variable.

multi_model1 <- brm(formula = BMXBMI ~ RIDAGEYR_sc + BMXWAIST_sc + BMXHT_sc + BMXWT_sc, 
             data    = nhanes1518,
             seed    = 123)

Interpretation

We can once again interpret our model’s results with the summary function.

summary(multi_model1, priors = TRUE)
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: BMXBMI ~ RIDAGEYR_sc + BMXWAIST_sc + BMXHT_sc + BMXWT_sc 
##    Data: nhanes1518 (Number of observations: 10534) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Priors: 
## Intercept ~ student_t(3, 28.3, 6.4)
## <lower=0> sigma ~ student_t(3, 0, 6.4)
## 
## Population-Level Effects: 
##             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept      29.53      0.01    29.51    29.54 1.00     4105     2265
## RIDAGEYR_sc    -0.07      0.01    -0.09    -0.06 1.00     2659     2529
## BMXWAIST_sc     0.28      0.03     0.23     0.34 1.00     1692     1922
## BMXHT_sc       -3.49      0.01    -3.51    -3.47 1.00     2127     2551
## BMXWT_sc        7.70      0.03     7.65     7.76 1.00     1627     1881
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     0.84      0.01     0.83     0.85 1.00     4902     2819
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

With these statistics, we can now fill in our model:

\[\hat{BMI} = 29.53 - 0.07(Age\_sc) + 0.28(Waist\_sc) - 3.49(Height\_sc) + 7.70(Weight\_sc)\]

We can see that the \(\hat{\beta_{Height_sc}}\) and \(\hat{\beta_{Weight_sc}}\) are the largest in magnitude. This means that for a one unit increase in scaled height or scaled weight, we expect a larger change in BMI compared to a one unit increase in scaled age or scaled waist size. This makes intuitive sense, as BMI is calculated from height and weight.

We can interpret the estimate of the slope of age to mean that for every 1 unit increase in scaled age, we expect BMI to decrease by 0.07 on average, holding all other factors constant. Since we scaled our data, we also have a meaningful interpretation for our intercept: we expect a person with the average age, waist size, height, and weight to have a BMI of 29.53.

Inference

We can again look at the l-95% CI and u-95% CI columns to obtain 95% credible intervals for our intercept and estimates. Since all of the intervals do not include 0, we can reject our null hypothesis for all variables and conclude that, holding all other factors constant, there is likely a linear relationship between height and BMI, weight and BMI, age and BMI, and waist size and BMI.

Prior Selection

Introduction

So far in this tutorial, we have used the default priors for our linear regression models. However, one of the foundations of Bayesian statistics is setting an informative prior, or one that tells us something about the data.

Prior selection is especially important when we do not have a lot of data. In these cases, since there is not much data to go off of, much of our inference comes from the prior. This means it is vital that we select a fitting distribution, or we risk an inaccurate posterior.

Different priors have different motivations and properties. Some scrupulous scientists want to be as objective as possible when choosing a prior, while others may aim to choose a prior that satisfies experts with different opinions. This section will discuss common priors used for coefficients (\(\beta\)) and variances (\(\sigma^2\)).

Since we have thousands of observations in our dataset, the different priors we may set would yield similar results. To better demonstrate how choosing different priors can impact our model, let’s take a sample of our dataset of size n = 30.

set.seed(456)
nhanes1518_sample <- sample_n(nhanes1518, 30)

Now, using our sampled data, let’s see the results of a model using default priors.

default_prior <- brm(formula = BMXBMI ~ RIDAGEYR, 
             data    = nhanes1518_sample,
             seed    = 123)
summary(default_prior, priors = TRUE)
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: BMXBMI ~ RIDAGEYR 
##    Data: nhanes1518_sample (Number of observations: 26) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Priors: 
## Intercept ~ student_t(3, 27.5, 5.5)
## <lower=0> sigma ~ student_t(3, 0, 5.5)
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    36.19      5.43    25.57    47.06 1.00     3558     2810
## RIDAGEYR     -0.12      0.10    -0.31     0.07 1.00     3421     2435
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     9.40      1.35     7.19    12.51 1.00     3767     2878
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

We get the model \(\hat{BMI} = 36.19 - 0.12(Age)\). We will compare this to models using specified priors.

Coefficient Priors: The Normal Distribution

A common prior for the coefficient (or \(\beta\)) in Bayesian inference deals with a normal (Gaussian) distribution. This is a conjugate prior, meaning that the posterior will also be a normal distribution, but with different parameters. This might be a good choice for our beta prior if we assume our data is normally distributed, meaning most values are situated in the middle, and there is a symmetrical decline in the frequency of values as we reach the ends.

We will use the same BMXBMI ~ RIDAGEYR model we worked with in the single linear regression section. Let’s assume the BMI values in our sample are normally distributed. We will choose a normal distribution with \(\mu = 0\) and \(\sigma = 0.3\) for \(\beta_{Age}\). We can use the prior argument in the brm function to set our prior for \(\beta\). In this argument, we set class = "b" and coef = "RIDAGEYR" to signify that this is a prior on \(\beta_{Age}\).

model2 <- brm(formula = BMXBMI ~ RIDAGEYR, 
             data    = nhanes1518_sample,
             prior   = c(set_prior("normal(0, 0.3)", class = "b", coef = "RIDAGEYR")),
             seed    = 123)

Now, let’s use the mcmc_intervals function of the bayesplot package to compare the values of \(\beta_{Age}\) for each of our two models so far:

(mcmc_intervals(
  default_prior,
  point_est = "mean",
  prob_outer = 0.95,
  pars = c("b_RIDAGEYR")) +
   coord_cartesian(xlim = c(-0.35, 0.1)) +
   labs(title = "Default Coefficient Prior") +
   theme_minimal()
 ) /
  (mcmc_intervals(
    model2,
    point_est = "mean",
    pars = c("b_RIDAGEYR")) +
     coord_cartesian(xlim = c(-0.35, 0.1)) +
     labs(title = TeX("Normal(0, $0.03^2$) Coefficient Prior")) +
     theme_minimal()
   )

The thick line represents a 50% confidence interval, while the thin line represents a 95% confidence interval. As we can see, both models have similar point estimates for the mean of \(\beta_{Age}\), but the intervals differ in length. Our \(N(0, 0.3^2)\) model has a narrower interval, likely because we supplied a prior with a small variance. This signifies how different priors can yield different posteriors.

Let’s run another model with a normally distributed \(\beta\) prior, using the same mean but inputting a much larger variance.

model3 <- brm(formula = BMXBMI ~ RIDAGEYR, 
             data    = nhanes1518_sample,
             prior   = c(set_prior("normal(0, 100)", class = "b", coef = "RIDAGEYR")),
             seed    = 123)
mcmc_intervals(
  model3,
  point_est = "mean",
  prob_outer = 0.95,
  pars = c("b_RIDAGEYR")) +
   coord_cartesian(xlim = c(-0.35, 0.1)) +
   labs(title = TeX("Normal(0, $100^2$) Coefficient Prior")) +
   theme_minimal()

The results of this model more closely resemble our default model as compared to our model with the \(N(0, 0.03^2)\) prior. This is because our extremely large variance means we are very unsure about the value of \(\beta_{Age}\); thus, this prior is largely noninformative, similar to the default.

Let’s recap the models we’ve created so far with their results:

Coefficient Priors: The T-Distribution

Another common prior for the coefficient in Bayesian inference deals with the student t-distribution. This differs from the normal distribution in that the values are less clustered in the middle, giving it heavier tails on the ends. Using the t-distribution can offer more robust estimation; it is often used on smaller, approximately normal datasets (for example, when n < 30). For the purpose of this section, we will proceed using a t-distribution to estimate \(\beta_{Age}\).

We can change the shape of our t-distribution by altering the degrees of freedom; the more degrees of freedom we have, the more we will resemble a normal distribution (so, the less heavy the tail ends will be). Let’s fit a model with a beta prior distributed as a unit t-distribution with 3 degrees of freedom:

model4 <- brm(formula = BMXBMI ~ RIDAGEYR, 
             data    = nhanes1518_sample,
             prior   = c(set_prior("student_t(3, 0, 1)", class = "b", coef = "RIDAGEYR")),
             seed    = 123)
mcmc_intervals(
  model4,
  point_est = "mean",
  prob_outer = 0.95,
  pars = c("b_RIDAGEYR")) +
   coord_cartesian(xlim = c(-0.35, 0.1)) +
   labs(title = TeX("Student_T(3, 0, 1) Coefficient Prior")) +
   theme_minimal()

Again, our model results resemble those we have previously calculated. For demonstrative purposes, let’s also run a model with more degrees of freedom (a more Gaussian-like distribution).

model5 <- brm(formula = BMXBMI ~ RIDAGEYR, 
             data    = nhanes1518_sample,
             prior   = c(set_prior("student_t(10000, 0, 1)", class = "b", coef = "RIDAGEYR")),
             seed    = 123)
mcmc_intervals(
  model5,
  point_est = "mean",
  prob_outer = 0.95,
  pars = c("b_RIDAGEYR")) +
   coord_cartesian(xlim = c(-0.35, 0.1)) +
   labs(title = TeX("Student_T(10000, 0, 1) Coefficient Prior")) +
   theme_minimal()

And now let’s once again recap our models:

Upon selecting our priors and running our models, we can then use our posterior information to conduct further regression. Usually, regardless of the initial choice of prior, results will end up converging to the same posterior after many iterations. In this case, since our sample size is now very small (n = 30), it might be more intelligent to start with a student-t distribution for our \(\beta_{Age}\) parameter.

Variance Priors: The Inverse-Gamma

Now let’s move on to our variance (or \(\sigma^2\)) parameter. A common prior for the variance in Bayesian inference deals with the inverse gamma distribution. This particular distribution restricts the variance to only positive values (we cannot have a negative variance as it is the value of the standard deviation squared). The inverse gamma distribution takes two parameters, \(\alpha\) and \(\beta\), which impact the shape (different permutations of these parameters are pictured above). It is also a conjugate distribution, so our posterior will be an inverse gamma distribution.

Let’s run two models with inverse-gamma variance priors, using the parameters of the light blue line and the red line in the visualization above. We will use our nhanes1518_sample dataset that we created in the previous section.

model6 <- brm(formula = BMXBMI ~ RIDAGEYR, 
             data    = nhanes1518_sample,
             prior   = c(set_prior("inv_gamma(3, 0.5)", class = "sigma")),
             seed    = 123)
model7 <- brm(formula = BMXBMI ~ RIDAGEYR, 
             data    = nhanes1518_sample,
             prior   = c(set_prior("inv_gamma(1, 1)", class = "sigma")),
             seed    = 123)

Variance Priors: The Half-Cauchy Distribution

Another common prior for the variance in Bayesian inference deals with the Half-Cauchy distribution. The Cauchy distribution is equivalent to a student t-distribution with 1 degree of freedom. We then restrict this to only contain positive values (as again, our variance cannot be negative) to get the Half-Cauchy distribution. The Half-Cauchy takes two parameters, \(\mu\) and \(\sigma\), which impact the shape (two permutations of these parameters are pictured above). Similar to the student-t distribution, the Half-Cauchy has a heavy tail and provides more robust estimation.

Let’s run two models with Half-Cauchy variance priors, using the parameters of the two examples pictured above. We will continue to use our nhanes1518_sample dataset. If we specify our sigma prior to be a cauchy, the brms package will automatically restrict it to positive values, thus effectively making it a Half-Cauchy.

model8 <- brm(formula = BMXBMI ~ RIDAGEYR, 
             data    = nhanes1518_sample,
             prior   = c(set_prior("cauchy(0, 0.1)", class = "sigma")),
             seed    = 123)
model9 <- brm(formula = BMXBMI ~ RIDAGEYR, 
             data    = nhanes1518_sample,
             prior   = c(set_prior("cauchy(0, 1)", class = "sigma")),
             seed    = 123)

Let’s visualize the results from all of our variance models with the mcmc_intervals function

Notice specifically how the model with the \(InvGam(3, 0.5)\) prior for the variance sticks out from the rest in terms of posterior \(\sigma\). This is likely because of the prior’s large \(\alpha\) parameter, which forces the prior to very sharply fit the data. Meanwhile, the model with the \(Half-Cauchy(0, 1)\) prior for the variance has the widest credible interval. This is likely because of the prior’s large \(\sigma\) parameter, which corresponds to the prior weakly fitting the data.

Priors for Multivariate Regression

When we construct a multivariate regression model, we have more parameters for which we can set a prior. For this section, we will construct the same model as in the section on multivariate regression, with BMI as the response and age, waist size, weight, and height as the predictors. However, we will now use the nhanes1518_sample dataset. We first need to scale our variables.

nhanes1518_sample['RIDAGEYR_sc'] <- scale(nhanes1518_sample$RIDAGEYR)
nhanes1518_sample['BMXWAIST_sc'] <- scale(nhanes1518_sample$BMXWAIST)
nhanes1518_sample['BMXWT_sc'] <- scale(nhanes1518_sample$BMXWT)
nhanes1518_sample['BMXHT_sc'] <- scale(nhanes1518_sample$BMXHT)

Now, let’s create a model with the default parameters, which we will compare to a model with specified priors.

default_multi <- brm(formula = BMXBMI ~ RIDAGEYR_sc + BMXWAIST_sc + BMXHT_sc + BMXWT_sc, 
             data    = nhanes1518_sample,
             seed    = 123)

In brms, we can set the prior for each coefficient separately. For this model, we will use a normal distribution prior for each \(\beta\) with a mean of 0. Since we don’t know much about the data, we want our prior to be relatively flat; thus we will use a relatively large variance of 100. We will also use a simple \(HalfCauchy(0, 1)\) prior for sigma, and a very flat \(N(0, 500)\) prior for the intercept.

After we create our model, we will compare the results of the two models with mcmc_intervals.

priors2 <- c(set_prior("normal(0, 10)", class = "b", coef = 'RIDAGEYR_sc'),
             set_prior("normal(0, 10)", class = "b", coef = 'BMXWT_sc'),
             set_prior("normal(0, 10)", class = "b", coef = 'BMXWAIST_sc'),
             set_prior("normal(0, 10)", class = "b", coef = 'BMXHT_sc'),
             set_prior("cauchy(1, 1)", class = "sigma"),
             set_prior("normal(0, 22.3607)", class = "Intercept"))

multi_model2 <- brm(formula = BMXBMI ~ RIDAGEYR_sc + BMXWAIST_sc + BMXHT_sc + BMXWT_sc, 
             data    = nhanes1518_sample,
             prior   = priors2,
             seed    = 123)

As we can see, the results of both models are very similar. This makes sense, as all of our specified priors in the second model were very unspecific (due to large variances). If we fit sharper priors, we could expect to see a larger difference in the posterior from the default model.

Conclusion

Prior selection is a pivotal step of the Bayesian process. An informative prior can help make up for a lack of available data, allowing us to still obtain a posterior from which we can make meaningful interpretations and inferences.

This section goes over two common priors for both \(\beta\) and \(\sigma^2\) values. Using a normal distribution for \(\beta\) and an inverse-gamma distribution for \(\sigma^2\) can be a good choice if we believe most of our data is centered around the mean. However, Bayesian statisticians tend to use a t-distribution for \(\beta\) and a Half-Cauchy distribution for \(\sigma^2\) as these are insensitive to outliers. Regardless, there is no predefined “best answer” on what prior to use in any given situation; it depends on our knowledge of the information.

Model Evaluation

In the section on single linear regression, we mentioned a list of assumptions for Bayesian linear regression that help ensure the validity of our model. First, we need to check the assumptions for a linear model using residual plots, which is done in the same way in both frequentist and Bayesian statistics. The module on frequentist linear regression discusses how to go about checking these assumptions.

In addition to these checks, there also exists some special diagnostics for Bayesian models. Most of the time, we usually cannot determine the exact posterior distribution from our model, but we can use MCMC sampling to draw samples from the posterior distribution. Therefore, we need to check MCMC diagnostics to ensure that the MCMC samples we obtain can accurately represent the true posterior distribution.

MCMC Diagnostics & Trace Plots

In order to see if our Bayesian analysis is working and performing well, we need to examine Markov Chain Monte Carlo (MCMC) diagnostics. In statistics, the Monte Carlo method refers to when we generate a sample of independent draws from the probability distribution of a random variable \(X\) in order to estimate a feature of that distribution. The Markov Chain Monte Carlo method is similar, but in this case the items in the generated sample are serially correlated rather than independent. During model evaluation, samples from our posterior distributions will be generated using the MCMC method; we then can use trace plots to examine if our model meets MCMC diagnostics. Trace plots are line charts showing time on the x-axis, and values from the sample draws on the y-axis. We can use the plot() function in the brms package to display trace plots.

Let’s first examine a model that passes MCMC diagnostics. We will continue to use the nhanes1518_sample dataset, and we will focus on simple linear regression models.

mcmc1 <- brm(formula = BMXBMI ~ RIDAGEYR, 
             data    = nhanes1518_sample,
             seed    = 123)
plot(mcmc1)

The plots on the right side are the trace plots. We want to see something that represents a fuzzy caterpillar; this means that the generated draws have low serial correlation and that the sample space is explored many times. All three of our trace plots for this model fit the criteria. We can also do summary(mcmc1) and look at the values in the Rhat column; a value of 1 means our Markov chains have converged sufficiently, which is what we are looking for.

Now, let’s examine some models that fail MCMC diagnostics. Let’s run the same model as the previous example, but set the iter parameter to equal 50 (brms defaults this parameter to 2000).

mcmc2 <- brm(formula = BMXBMI ~ RIDAGEYR, 
             data    = nhanes1518_sample,
             seed    = 123,
             iter    = 50)
plot(mcmc2)

Here, we see our trace plots look vastly different than the ones in the iter = 2000 example. With plots like this, we can infer that the generated draws have high serial correlations and that the chains are failing to explore the sample space as frequently. This means that the effective size of our sample is too small; we can fix this issue by increasing the number of iterations (in this case, changing the iter parameter when fitting our model).

For this next model, we will use the same structure as the previous two models, except we will set the iter parameter to 400 and the warmup to 10 (the warmup parameter defaults to iter/2).

mcmc3 <- brm(formula = BMXBMI ~ RIDAGEYR, 
             data    = nhanes1518_sample,
             seed    = 123,
             iter    = 400,
             warmup  = 10)
plot(mcmc3)

Here, we see that the first portions of the trace plots look very different than the rest; this means our chains are eventually converging to the target distribution (as we get to the ‘fuzzy caterpillar’ portion), but our initial distributions are largely different than the target. We can fix this by increasing the warmup parameter, which discards a number of the initial observations.

Autocorrelation

Trace plots allow us to examine the extent to which serial correlation exists among the draws from our model. We can look at this serial correlation more specifically by plotting the autocorrelation function (ACF) of our model.

Let’s plot the autocorrelation for mcmc1, the model that we just showed passes MCMC diagnostics. We can use the mcmc_acf function of the bayesplot package to do so, using the pars argument to specify which ACFs we want to see. Recall that an assumption for Bayesian linear regression is the presence of no autocorrelation; thus, we want the ACFs to quickly approach zero.

mcmc_acf(mcmc1, pars = c("b_Intercept", "b_RIDAGEYR", "sigma"))

The ACFs for this model all quickly approach zero, meaning this is a valid model in terms of the no autocorrelation assumption.

Let’s compare these plots to those for the mcmc3 model, which we just showed fails MCMC diagnostics.

mcmc_acf(mcmc3, pars = c("b_Intercept", "b_RIDAGEYR", "sigma"))

These plots take much longer to reach zero, indicating a high degree of serial correlation. We would likely want to refit this model.

Effective Sample Size

Autocorrelation is closely related to another indicator used in model evaluation: the effective sample size. This metric, often notated as \(n_{eff}\), estimates the total number of independent draws made from sampling the posterior distribution. Autocorrelation, of course, means that some sequential draws will not be independent. Thus, \(n_{eff}\) is usually smaller than the overall sample size \(N\).

We can examine effective sample size using the neff_ratio and mcmc_neff functions from the bayesplot package. neff_ratio calculates the ratio \(\frac{n_{eff}}{N}\) for each parameter specified in the pars argument. mcmc_neff then takes a vector of these ratios and plots them. Let’s continue to use mcmc1, our diagnostic-passing model, and plot its effective sample sizes. We add theme_minimal to make the plot easier to read.

neff_mcmc1 <- neff_ratio(mcmc1, pars = c("b_Intercept", "b_RIDAGEYR", "sigma"))
mcmc_neff(neff_mcmc1) + theme_minimal()

This plot shows that the effective sample sizes for the parameters of this model are all close to the overall sample sizes. We want to see ratios closer to 1 (indicated by the lightest blue lines), as these indicate the lowest levels of autocorrelation.

Let’s compare these results to those for mcmc3, one of our diagnostic-failing models.

neff_mcmc3 <- neff_ratio(mcmc3, pars = c("b_Intercept", "b_RIDAGEYR", "sigma"))
mcmc_neff(neff_mcmc3) + theme_minimal()

As we can see, the effective sample size ratios are much lower for this model, as indicated by darker blue lines. This means there are very little amounts of independent draws from the posterior distributions, indicating high levels of autocorrelation. As concluded earlier, we would want to refit this model.

Model Comparison

Suppose we have multiple Bayesian linear regression models, and we desire to know which one is “better”. To accomplish this, we can do model comparison to determine which of our models minimizes prediction error.

Widely Applicable Information Criterion (WAIC)

One metric we can use to compare Bayesian models is the Widely Applicable Information Criterion (WAIC), which is an extension of the AIC metric. This statistic is a measurement of prediction error, and therefore we want to choose the model with the smallest WAIC.

Let’s compare the three models we just used to evaluate MCMC diagnostics. We can visualize their WAIC values, which are calculated using the waic() method in the brms package.

waic1 <- waic(mcmc1)$waic
waic2 <- waic(mcmc2)$waic
waic3 <- waic(mcmc3)$waic
wx <- c(waic1, waic2, waic3)
barplot(wx, ylab = "WAIC", names.arg=paste("mcmc", 1:3))

We see that the WAIC for mcmc1 is the lowest. This makes intuitive sense, as this was the model that passed MCMC diagnostics. If we were comparing these three models, we would proceed with mcmc1.

We can also use WAIC to determine what are the best priors to select for our data. Let’s take a look at the first five models we fitted in this module, using the nhanes1518_sample dataset. Here’s a reminder of their prior distributions and their posterior statistics:

\(\beta_{Age}\) Distributions Posterior Means Intercepts
Default -0.12 36.19
N(0, 0.09) -0.11 35.77
N(0, 10000) -0.12 36.23
StudentT(3, 0, 1) -0.12 36.18
StudentT(10000, 0, 1) -0.12 36.27
waic_default <- waic(default_prior)$waic
waic_mod2 <- waic(model2)$waic
waic_mod3 <- waic(model3)$waic
waic_mod4 <- waic(model4)$waic
waic_mod5 <- waic(model5)$waic
wy <- c(waic_default, waic_mod2, waic_mod3, waic_mod4, waic_mod5)
barplot(wy, ylab = "WAIC", ylim = c(194, 196),
        names.arg = c("default", paste("model", 2:5)))

Although all of the models have very similar WAIC scores, model 4 (which uses a \(StudentT(3, 0, 1)\) prior for \(\beta_{Age}\)) slightly edges out the others. We would proceed with model 4. This makes sense as, since our sample size is very small, it is more fitting to use a student-t distribution than a normal distribution for our prior on \(\beta\).